Transformer model
The core architecture behind the LLMs. It uses Attention mechanism
- http://jalammar.github.io/illustrated-transformer/
- https://www.youtube.com/watch?v=-QH8fRhqFHM : GPT is more about decoders (generation), BERT is more about encoders (translation and representation).
Google’s T5 paper provides a unified framework to understand and train transformer models.
Tutorials and reviews
- A walkthrough of transformer architecture code by Mark Riedl
- Transformers from scratch
- “Attention”, “Transformers”, in Neural Network “Large Language Models” by Cosma Shalizi
- Understanding Encoder And Decoder LLMs by Sebastian Raschka
- The Illustrated Transformer by Jay Alammar
- Transformer explainer
- The Attention Mechanism in Large Language Models by Serrano
Implementations
See also Implementations
https://huggingface.co/blog/how-to-train shows how to train a transformer model from scratch. See also How to pretrain transformer models or A complete Hugging Face tutorial: how to build and train a vision transformer
- gpt-fast: https://github.com/pytorch-labs/gpt-fast
The Genius of DeepSeek’s 57X Efficiency Boost [MLA]
Applications
They are used in other areas outside Language models, including Computer vision and Reinforcement learning (Decision transformer).
Internal workings
Sanford2024transformers for the connection to Massively parallel computation.
Teh2025solving studies whether transformer can solve an empirical Bayes problem.
Cohen2025spectral studies how transformer model can predict Shortest path on a graph.
Circuit analysis
Park2025does identifies temporal heads by performing circuit analysis.